Analyzing Data Wrangling project 4¶

By Ekofiongo Eale¶

Udacity¶

Resources¶

In this project, I will work with two poweful Python packages, Pandas and Seaborn. Both packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data Visualization. The matplotlib web site has addition resources for learning plotting with Python tools.

In this project, I will you use two powerful Python packages. Pandas and Seaborn...

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import seaborn as sb
import tweepy
import requests
import os
import json
import time
%matplotlib inline

from requests import get
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

Gathering Data for this Project¶

- Importting the Twitter archive as CSV file unto a DataFrame (twitter-archive-enhanced.csv)¶

df_archive = pd.read_csv('./twitter-archive-enhanced.csv')

df_archive.shape

(2356, 17)

df_archive.describe()

df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

df_archive.tweet_id.nunique()

2356

df_archive.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

- Importting the image_predictions as TSV file unto a DataFrame¶

# Importing the tweet image predictions TSV file into a DataFrame
predict_img_df = pd.read_csv('image_predictions.tsv', sep='\t')

# Download tweet image predictions TSV using the Requests library and write it to image_predictions.tsv
response = requests.get("https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv")

response

<Response [200]>

# Download the page
#url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
#response = get(url)
#print(response.text[:1000])

# print statistics
# high-level overview of data shape and composition
print('Rows and columns: ' + str(predict_img_df.shape))
print('')
print(predict_img_df.dtypes)

Rows and columns: (2075, 12)

tweet_id      int64
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object

predict_img_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

predict_img_df.head()

- Importting the twitter_archive_master as CSV file unto a DataFrame¶

twitter_archive_master = pd.read_csv('twitter_archive_master.csv')

print('Rows: and columns: ' + str(twitter_archive_master.shape))

Rows: and columns: (1349, 25)

twitter_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null object
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               182 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), float64(4), int64(7), object(11)
memory usage: 235.9+ KB

twitter_archive_master.describe()

Tweets (Twitter API)¶

# !pip install tweepy

import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions

# app login
consumer_key = 'BCf0alB97VE7gvB4AZ9HNyZTu'
consumer_secret = 'BuAXliR0v6adMRgZVPYAS22P4CVY2HWIsPuBctqR56kmXf1Wus'

# user login
access_token = '1304789419711508480-yMou4A8k9qXe0mJvJIIGCK8sZYAQix'
access_secret = 'acxZ9SAdZ40jK7Iwpuw0VpzWKPJHbWQZRJEKqUosmvLxX'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor

# Iterate through all of the authenticated user's friends
#for friend in tweepy.Cursor(api.friends).items():
    # Process the friend here
    #print(friend)

screen_name = "dog_rates"
statuses = api.user_timeline(screen_name)

print("retrieved: ", len(statuses))
status = statuses[0]

data_list = []
for status in statuses:
    data_list.append(status._json)


df = pd.DataFrame(data_list)
df

retrieved:  20

The Twitter API gives developers access to most of Twitter’s functionality. You can use the API to read and write information related to Twitter entities such as tweets, users, and trends. Twitter API runs without error

df.describe()

Requête POST¶

# List of dictionaries to read tweet's JSON data line by line and later convert to a DataFrame
df_list = []
with open('tweet-json copy', 'r') as json_file:
    for line in json_file:
        status = json.loads(line)
        
 # Append to list of dictionaries
        df_list.append({'tweet_id': status['id'],
                        'retweet_count': status['retweet_count'],
                        'favorite_count': status['favorite_count'],
                        'display_text_range': status['display_text_range']
                       })

# Create a DataFrame with tweet ID, retweet count, favorite count and display_text_range
status_df = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count', 'display_text_range'])

status_df

status_df.shape

(2354, 4)

status_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB

Assessing Data¶

Checking missing values using heatmap function.¶

plt.figure(figsize=(20,10))
sns.heatmap(df_archive.isna(), cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f2afc08>

As we can notice, there is a lot of missing data in the columns concerning the response and retweeted status. Since we only want the original messages with images, we have to delete them later - the missing data in the "expanded_urls" column will also disappear with this cleanup operation.

float histograms for df_archive¶

for col in df_archive.select_dtypes('float64'):
    plt.figure()
    sns.distplot(df_archive[col])

Qualitatives Variables df_archive¶

 for col in df_archive.select_dtypes('object'):
    print(f'{col :-<50} {df_archive[col].nunique()}')

timestamp----------------------------------------- 2356
source-------------------------------------------- 4
text---------------------------------------------- 2356
retweeted_status_timestamp------------------------ 181
expanded_urls------------------------------------- 2218
name---------------------------------------------- 957
doggo--------------------------------------------- 2
floofer------------------------------------------- 2
pupper-------------------------------------------- 2
puppo--------------------------------------------- 2

count numbers of values in columns doggo¶

df_archive['doggo'].value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

count numbers of values in columns floofer¶

df_archive['floofer'].value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

count numbers of values in columns puppo¶

df_archive['puppo'].value_counts()

None     2326
puppo      30
Name: puppo, dtype: int64

count numbers of values in columns pupper¶

df_archive['pupper'].value_counts()

None      2099
pupper     257
Name: pupper, dtype: int64

Only for 16% of the rows the data is not missing. Now let's take a look at the ratings. By what we have seen so far, it looks like the ratings have always a format of 13/10 or 12/10 and so on. So we would expect a numerator > 10 and denominator = 10.

df_archive.nunique()

tweet_id                      2356
in_reply_to_status_id           77
in_reply_to_user_id             31
timestamp                     2356
source                           4
text                          2356
retweeted_status_id            181
retweeted_status_user_id        25
retweeted_status_timestamp     181
expanded_urls                 2218
rating_numerator                40
rating_denominator              18
name                           957
doggo                            2
floofer                          2
pupper                           2
puppo                            2
dtype: int64

for col in df_archive.select_dtypes('object'):
    print(f'{col :-<50} {df_archive[col].value_counts(10)}')

timestamp----------------------------------------- 2016-03-11 21:15:02 +0000    0.000424
2016-05-03 15:46:33 +0000    0.000424
2016-11-15 01:44:00 +0000    0.000424
2015-12-04 01:55:13 +0000    0.000424
2016-02-26 04:48:02 +0000    0.000424
                               ...   
2016-06-06 15:40:26 +0000    0.000424
2016-02-20 01:00:55 +0000    0.000424
2016-05-27 01:47:23 +0000    0.000424
2016-07-30 17:56:51 +0000    0.000424
2015-11-25 03:06:32 +0000    0.000424
Name: timestamp, Length: 2356, dtype: float64
source-------------------------------------------- <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     0.942699
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                        0.038625
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                     0.014007
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>    0.004669
Name: source, dtype: float64
text---------------------------------------------- RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/…             0.000424
This is Erik. He's fucken massive. But also kind. Let's people hug him for free. Looks soft. 11/10 I would hug Erik https://t.co/MT7Q4aDQS1                0.000424
This is Jay. He's really h*ckin happy about the start of fall. Sneaky tongue slip in 2nd pic. 11/10 snuggly af https://t.co/vyx1X5eyWI                     0.000424
We normally don't rate bears but this one seems nice. Her name is Thea. Appears rather fluffy. 10/10 good bear https://t.co/fZc7MixeeT                     0.000424
This is Sammy. Her tongue ejects without warning sometimes. It's a serious condition. Needs a hefty dose from a BlepiPen. 13/10 https://t.co/g20EmqK7vc    0.000424
                                                                                                                                                             ...   
RT @dog_rates: Say hello to Cooper. His expression is the same wet or dry. Absolute 12/10 but Coop desperately requests your help\n\nhttps://…             0.000424
This is Tito. He's on the lookout. Nobody knows for what. 10/10 https://t.co/Qai481H6RA                                                                    0.000424
This is Cash. He's officially given pup on today. 12/10 frighteningly relatable https://t.co/m0hrATIEyw                                                    0.000424
RT @dog_rates: This is Sampson. He just graduated. Ready to be a doggo now. Time for the real world. 12/10 have fun with taxes https://t.co…               0.000424
This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt                 0.000424
Name: text, Length: 2356, dtype: float64
retweeted_status_timestamp------------------------ 2015-12-28 17:12:42 +0000    0.005525
2016-10-12 15:55:59 +0000    0.005525
2017-01-15 21:49:15 +0000    0.005525
2016-07-12 18:27:35 +0000    0.005525
2017-01-11 02:15:36 +0000    0.005525
                               ...   
2017-05-27 19:39:34 +0000    0.005525
2016-02-21 01:19:47 +0000    0.005525
2016-05-19 01:38:16 +0000    0.005525
2016-11-02 23:45:19 +0000    0.005525
2016-06-13 18:27:32 +0000    0.005525
Name: retweeted_status_timestamp, Length: 181, dtype: float64
expanded_urls------------------------------------- https://twitter.com/dog_rates/status/683391852557561860/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/753375668877008896/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/681523177663676416/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/700143752053182464/photo/1                                                                    0.000871
https://twitter.com/dog_rates/status/820749716845686786/photo/1,https://twitter.com/dog_rates/status/820749716845686786/photo/1    0.000871
                                                                                                                                     ...   
https://twitter.com/dog_rates/status/875144289856114688/video/1                                                                    0.000435
https://twitter.com/dog_rates/status/673689733134946305/photo/1,https://twitter.com/dog_rates/status/673689733134946305/photo/1    0.000435
https://twitter.com/dog_rates/status/772152991789019136/photo/1,https://twitter.com/dog_rates/status/772152991789019136/photo/1    0.000435
https://twitter.com/dog_rates/status/678740035362037760/photo/1                                                                    0.000435
https://twitter.com/dog_rates/status/672139350159835138/photo/1                                                                    0.000435
Name: expanded_urls, Length: 2218, dtype: float64
name---------------------------------------------- None       0.316214
a          0.023345
Charlie    0.005093
Oliver     0.004669
Lucy       0.004669
             ...   
Comet      0.000424
Ed         0.000424
Spencer    0.000424
Mimosa     0.000424
Pavlov     0.000424
Name: name, Length: 957, dtype: float64
doggo--------------------------------------------- None     0.958829
doggo    0.041171
Name: doggo, dtype: float64
floofer------------------------------------------- None       0.995756
floofer    0.004244
Name: floofer, dtype: float64
pupper-------------------------------------------- None      0.890917
pupper    0.109083
Name: pupper, dtype: float64
puppo--------------------------------------------- None     0.987267
puppo    0.012733
Name: puppo, dtype: float64

not all tweets could be classified as doggo, floofer, pupper or puppo and all columns contain "None"
the source contains unnecessary HTML code
there is the name "None" in the name column

for col in df_archive.select_dtypes('int64'):
    plt.figure()
    sns.distplot(df_archive[col])

Check if all records in df_archive are retweets¶

len(df_archive[df_archive.retweeted_status_id.isnull() == False])

181

# Checking if there are any records in df whose corresponding record with same tweet_id is missing in img_df table
len(df_archive[~df_archive.tweet_id.isin(predict_img_df.tweet_id)])

281

df_archive.name.value_counts().sort_index(ascending=False)

very            5
unacceptable    1
this            1
the             8
such            1
               ..
Aiden           1
Adele           1
Acro            1
Ace             1
Abby            2
Name: name, Length: 957, dtype: int64

# Sort by rating_denominator values
df_archive.rating_denominator.value_counts().sort_index()

0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64

df_archive.name.value_counts()

None       745
a           55
Charlie     12
Oliver      11
Lucy        11
          ... 
Comet        1
Ed           1
Spencer      1
Mimosa       1
Pavlov       1
Name: name, Length: 957, dtype: int64

descriptive statistics for numeric variables¶

print(df_archive.describe())

           tweet_id  in_reply_to_status_id  in_reply_to_user_id  \
count  2.356000e+03           7.800000e+01         7.800000e+01   
mean   7.427716e+17           7.455079e+17         2.014171e+16   
std    6.856705e+16           7.582492e+16         1.252797e+17   
min    6.660209e+17           6.658147e+17         1.185634e+07   
25%    6.783989e+17           6.757419e+17         3.086374e+08   
50%    7.196279e+17           7.038708e+17         4.196984e+09   
75%    7.993373e+17           8.257804e+17         4.196984e+09   
max    8.924206e+17           8.862664e+17         8.405479e+17   

       retweeted_status_id  retweeted_status_user_id  rating_numerator  \
count         1.810000e+02              1.810000e+02       2356.000000   
mean          7.720400e+17              1.241698e+16         13.126486   
std           6.236928e+16              9.599254e+16         45.876648   
min           6.661041e+17              7.832140e+05          0.000000   
25%           7.186315e+17              4.196984e+09         10.000000   
50%           7.804657e+17              4.196984e+09         11.000000   
75%           8.203146e+17              4.196984e+09         12.000000   
max           8.874740e+17              7.874618e+17       1776.000000   

       rating_denominator  
count         2356.000000  
mean            10.455433  
std              6.745237  
min              0.000000  
25%             10.000000  
50%             10.000000  
75%             10.000000  
max            170.000000

df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

predictions DataSet: predict_img_df¶

predict_img_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

predict_img_df.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

predict_img_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Checking predict_img_df plot by using pie¶

plt.figure(figsize=(8,6))
predict_img_df.dtypes.value_counts().plot.pie()

<matplotlib.axes._subplots.AxesSubplot at 0x1fb90db53c8>

for col in predict_img_df.select_dtypes('float64'):
    plt.figure()
    sns.distplot(predict_img_df[col])

predict_img_df.describe()

min and max of prediction are well in 0-1 range, as expected. And confidence decrease from first to third prediction.

 for col in predict_img_df.select_dtypes('object'):
    print(f'{col :-<50} {predict_img_df[col].nunique()}')

jpg_url------------------------------------------- 2009
p1------------------------------------------------ 378
p2------------------------------------------------ 405
p3------------------------------------------------ 408

The best way to find duplicates is to look at the jpg - url. If there are value counts > 1, then this data contains duplicates/retweets and only want Tweets with pictures which contain dogs. Let's see if there are pictures, for which the ML - Algorithm didn't predict any dogs.

for col in predict_img_df.select_dtypes('object'):
    print(f'{col :-<60} {predict_img_df[col].value_counts()}')

jpg_url----------------------------------------------------- https://pbs.twimg.com/media/CpmyNumW8AAAJGj.jpg    2
https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg    2
https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg    2
https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg    2
https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg    2
                                                  ..
https://pbs.twimg.com/media/CWD_jQMWEAAdYwH.jpg    1
https://pbs.twimg.com/media/CVtvf6bWwAAd1rT.jpg    1
https://pbs.twimg.com/media/CqBiMAgWAAEJKgI.jpg    1
https://pbs.twimg.com/media/CUTl5m1WUAAabZG.jpg    1
https://pbs.twimg.com/media/CVRGDrsWsAAUWSF.jpg    1
Name: jpg_url, Length: 2009, dtype: int64
p1---------------------------------------------------------- golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
agama                   1
wooden_spoon            1
grey_fox                1
rain_barrel             1
flamingo                1
Name: p1, Length: 378, dtype: int64
p2---------------------------------------------------------- Labrador_retriever    104
golden_retriever       92
Cardigan               73
Chihuahua              44
Pomeranian             42
                     ... 
Gila_monster            1
iPod                    1
neck_brace              1
cardigan                1
otter                   1
Name: p2, Length: 405, dtype: int64
p3---------------------------------------------------------- Labrador_retriever    79
Chihuahua             58
golden_retriever      48
Eskimo_dog            38
kelpie                35
                      ..
wing                   1
eel                    1
grand_piano            1
theater_curtain        1
rapeseed               1
Name: p3, Length: 408, dtype: int64

We saw that some entry have nothing to with dogs : minibus, fountain, desktop_computer, ... But we could identify them with the column "p1_dog".

predict_img_df.p1_dog.value_counts()

True     1532
False     543
Name: p1_dog, dtype: int64

So, off 2 750 picture, 543 are identified as not dog.

Checking also the missing data first for predict_img_df¶

plt.figure(figsize=(20,10))
sns.heatmap(predict_img_df.isna(), cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f50f888>

predict_img_df.head()

the predicitions are sometimes lowercase, sometimes uppercase
there is an underscore instead of a whitespace between the words
there are rows with no prediciton of a dog (neither in 1, 2 nor 3)

Lets repeat the same pattern for this dataset: missing data → visual assessment.¶

plt.figure(figsize=(20,10))
sns.heatmap(status_df.isna(), cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1fb8f704808>

status_df.head()

status_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB

for col in status_df.select_dtypes('object'):
    print(f'{col :-<50} {status_df[col].value_counts()}')

display_text_range-------------------------------- [0, 140]     260
[0, 139]     147
[0, 138]      80
[0, 137]      78
[0, 116]      59
            ... 
[13, 105]      1
[0, 42]        1
[30, 60]       1
[0, 13]        1
[0, 18]        1
Name: display_text_range, Length: 143, dtype: int64

Quality¶

Concerning df_archive¶

unnecessary html tags in source column in place of utility name
df_archive_clean: In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. We notice that all of these observations have lowercase characters. We should clean up this field.
Change in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id to uint64 and use 0 as not-available
If expanded_urls is null, add by using tweet id
df_archive_clean: contains retweets and therefore, duplicates
Remove 343th entry from dataframe
Find when text has dog stages in it and add it to the corresponding column
Finding entries that are retweets by matching text pattern 'RT @...' and drop entries from the table
Get the decimal ratings by matching the text pattern "#.#/#" and save as new rating (only denominators had decimals)
Most ratings are expected to have a denominator of 10 and a numerator around 0...20. We have a look at potential outliers and manually suppress incorrect tweets.
Keep only those records in arc_df table whose tweet_id exists in img_df table.
Drop retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp columns from df_archive_clean table.
doggo, floofer, pupper and puppo columns in df_archive_clean table should be merged into one column named "dog_stage".
Replace the value 'None' with NaN (missing value).
'None' previously passed the test as a name but isn't one. Drop missing tweets with missing names: Previously 'None', now nan.

Concerning predict_img_df¶

p1, p2, p3 inconsistent capitalization (sometimes first letter is capital)
no empty data cell

Concerning status_df¶

- no empty data cell¶

Tidiness¶

Three data frames df_twitter_archive, df_image_predictions, and df_tweet_info should be one (combined table) since all tables' entries are each describing one tweet

Cleaning Data¶

Making copies of dataframes before cleaning¶

df_archive_clean = df_archive.copy()
predict_img_df_clean = predict_img_df.copy()
status_df_clean = status_df.copy()

unnecessary html tags in source column in place of utility name¶

Define¶

Strip all html anchor tags (i.e. <a..>) in source column and retain just the text in between the tags. Convert the datatype from string to categorical.

Code¶

df_archive_clean.source = df_archive_clean.source.str.replace(r'<(?:a\b[^>]*>|/a>)', '')
df_archive_clean.source = df_archive_clean.source.astype('category')

Test¶

df_archive_clean.source.value_counts()

Twitter for iPhone     2221
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source, dtype: int64

Define¶

df_archive_clean: In the name column, there are several values that are not dog names, like 'a', 'the', 'such', etc. We notice that all of these observations have lowercase characters. We should clean up this field.

Code¶

mask = [not x.islower() if isinstance(x,str) 
        and len(x)>0 else False 
        for x in df_archive_clean['name']]

df_archive_clean = df_archive_clean[mask]

Test¶

# Checking for invalid names

unique_names = df_archive_clean['name'].unique()

print("<a> in names?", "a" in unique_names)
print("<the> in names?", "the" in unique_names)

unique_names

<a> in names? False
<the> in names? False

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'Maya',
       'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey', 'Lilly',
       'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella', 'Grizzwald',
       'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey', 'Gary', 'Elliot',
       'Louis', 'Jesse', 'Romeo', 'Bailey', 'Duddles', 'Jack', 'Emmy',
       'Steven', 'Beau', 'Snoopy', 'Shadow', 'Terrance', 'Aja', 'Penny',
       'Dante', 'Nelly', 'Ginger', 'Benedict', 'Venti', 'Goose', 'Nugget',
       'Cash', 'Coco', 'Jed', 'Sebastian', 'Walter', 'Sierra', 'Monkey',
       'Harry', 'Kody', 'Lassie', 'Rover', 'Napolean', 'Dawn', 'Boomer',
       'Cody', 'Rumble', 'Clifford', 'Dewey', 'Scout', 'Gizmo', 'Cooper',
       'Harold', 'Shikha', 'Jamesy', 'Lili', 'Sammy', 'Meatball',
       'Paisley', 'Albus', 'Neptune', 'Quinn', 'Belle', 'Zooey', 'Dave',
       'Jersey', 'Hobbes', 'Burt', 'Lorenzo', 'Carl', 'Jordy', 'Milky',
       'Trooper', 'Winston', 'Sophie', 'Wyatt', 'Rosie', 'Thor', 'Oscar',
       'Luna', 'Callie', 'Cermet', 'George', 'Marlee', 'Arya', 'Einstein',
       'Alice', 'Rumpole', 'Benny', 'Aspen', 'Jarod', 'Wiggles',
       'General', 'Sailor', 'Astrid', 'Iggy', 'Snoop', 'Kyle', 'Leo',
       'Riley', 'Gidget', 'Noosh', 'Odin', 'Jerry', 'Charlie', 'Georgie',
       'Rontu', 'Cannon', 'Furzey', 'Daisy', 'Tuck', 'Barney', 'Vixen',
       'Jarvis', 'Mimosa', 'Pickles', 'Bungalo', 'Brady', 'Margo',
       'Sadie', 'Hank', 'Tycho', 'Stephan', 'Indie', 'Winnie', 'Bentley',
       'Ken', 'Max', 'Maddie', 'Pipsy', 'Monty', 'Sojourner', 'Odie',
       'Arlo', 'Sunny', 'Vincent', 'Lucy', 'Clark', 'Mookie', 'Meera',
       'Buddy', 'Ava', 'Rory', 'Eli', 'Ash', 'Tucker', 'Tobi', 'Chester',
       'Wilson', 'Sunshine', 'Lipton', 'Gabby', 'Bronte', 'Poppy',
       'Rhino', 'Willow', 'Orion', 'Eevee', 'Smiley', 'Logan', 'Moreton',
       'Klein', 'Miguel', 'Emanuel', 'Kuyu', 'Dutch', 'Pete', 'Scooter',
       'Reggie', 'Kyro', 'Samson', 'Loki', 'Mia', 'Malcolm', 'Dexter',
       'Alfie', 'Fiona', 'Mutt', 'Bear', 'Doobert', 'Beebop', 'Alexander',
       'Sailer', 'Brutus', 'Kona', 'Boots', 'Ralphie', 'Phil', 'Cupid',
       'Pawnd', 'Pilot', 'Ike', 'Mo', 'Toby', 'Sweet', 'Pablo', 'Nala',
       'Balto', 'Crawford', 'Gabe', 'Mattie', 'Jimison', 'Hercules',
       'Duchess', 'Harlso', 'Sampson', 'Sundance', 'Luca', 'Flash',
       'Finn', 'Peaches', 'Howie', 'Jazzy', 'Anna', 'Bo', 'Seamus',
       'Wafer', 'Chelsea', 'Tom', 'Moose', 'Florence', 'Autumn', 'Dido',
       'Eugene', 'Herschel', 'Strudel', 'Tebow', 'Chloe', 'Betty',
       'Timber', 'Binky', 'Dudley', 'Comet', 'Larry', 'Levi', 'Akumi',
       'Titan', 'Olivia', 'Alf', 'Oshie', 'Bruce', 'Chubbs', 'Sky',
       'Atlas', 'Eleanor', 'Layla', 'Rocky', 'Baron', 'Tyr', 'Bauer',
       'Swagger', 'Brandi', 'Mary', 'Moe', 'Halo', 'Augie', 'Craig',
       'Sam', 'Hunter', 'Pavlov', 'Maximus', 'Wallace', 'Ito', 'Milo',
       'Ollie', 'Cali', 'Lennon', 'Major', 'Duke', 'Reginald', 'Sansa',
       'Shooter', 'Django', 'Diogi', 'Sonny', 'Philbert', 'Marley',
       'Severus', 'Ronnie', 'Anakin', 'Bones', 'Mauve', 'Chef', 'Doc',
       'Sobe', 'Longfellow', 'Mister', 'Iroh', 'Baloo', 'Stubert',
       'Paull', 'Timison', 'Davey', 'Pancake', 'Tyrone', 'Snicku', 'Ruby',
       'Brody', 'Rizzy', 'Mack', 'Butter', 'Nimbus', 'Laika', 'Dobby',
       'Juno', 'Maude', 'Lily', 'Newt', 'Benji', 'Nida', 'Robin',
       'Monster', 'BeBe', 'Remus', 'Mabel', 'Misty', 'Happy', 'Mosby',
       'Maggie', 'Leela', 'Ralphy', 'Brownie', 'Meyer', 'Stella', 'Frank',
       'Tonks', 'Lincoln', 'Oakley', 'Dale', 'Rizzo', 'Arnie', 'Pinot',
       'Dallas', 'Hero', 'Frankie', 'Stormy', 'Mairi', 'Loomis', 'Godi',
       'Kenny', 'Deacon', 'Timmy', 'Harper', 'Chipson', 'Combo', 'Dash',
       'Bell', 'Hurley', 'Jay', 'Mya', 'Strider', 'Wesley', 'Solomon',
       'Huck', 'O', 'Blue', 'Finley', 'Sprinkles', 'Heinrich',
       'Shakespeare', 'Fizz', 'Chip', 'Grey', 'Roosevelt', 'Gromit',
       'Willem', 'Dakota', 'Dixie', 'Al', 'Jackson', 'Carbon', 'DonDon',
       'Kirby', 'Lou', 'Nollie', 'Chevy', 'Tito', 'Louie', 'Rupert',
       'Rufus', 'Brudge', 'Shadoe', 'Colby', 'Angel', 'Brat', 'Tove',
       'Aubie', 'Kota', 'Eve', 'Glenn', 'Shelby', 'Sephie', 'Bonaparte',
       'Albert', 'Wishes', 'Rose', 'Theo', 'Rocco', 'Fido', 'Emma',
       'Spencer', 'Lilli', 'Boston', 'Brandonald', 'Corey', 'Leonard',
       'Chompsky', 'Beckham', 'Devón', 'Gert', 'Watson', 'Rubio', 'Keith',
       'Dex', 'Carly', 'Ace', 'Tayzie', 'Grizzie', 'Fred', 'Gilbert',
       'Zoe', 'Stewie', 'Calvin', 'Lilah', 'Spanky', 'Jameson', 'Piper',
       'Atticus', 'Blu', 'Dietrich', 'Divine', 'Tripp', 'Cora', 'Huxley',
       'Keurig', 'Bookstore', 'Linus', 'Abby', 'Shaggy', 'Shiloh',
       'Gustav', 'Arlen', 'Percy', 'Lenox', 'Sugar', 'Harvey', 'Blanket',
       'Geno', 'Stark', 'Beya', 'Kilo', 'Kayla', 'Maxaroni', 'Doug',
       'Edmund', 'Aqua', 'Theodore', 'Chase', 'Rorie', 'Simba', 'Charles',
       'Bayley', 'Axel', 'Storkson', 'Remy', 'Chadrick', 'Kellogg',
       'Buckley', 'Livvie', 'Terry', 'Hermione', 'Ralpher', 'Aldrick',
       'Rooney', 'Crystal', 'Ziva', 'Stefan', 'Pupcasso', 'Puff',
       'Flurpson', 'Coleman', 'Enchilada', 'Raymond', 'Rueben',
       'Cilantro', 'Karll', 'Sprout', 'Blitz', 'Bloop', 'Lillie',
       'Ashleigh', 'Kreggory', 'Sarge', 'Luther', 'Ivar', 'Jangle',
       'Schnitzel', 'Panda', 'Berkeley', 'Ralphé', 'Charleson', 'Clyde',
       'Harnold', 'Sid', 'Pippa', 'Otis', 'Carper', 'Bowie',
       'Alexanderson', 'Suki', 'Barclay', 'Skittle', 'Ebby', 'Flávio',
       'Smokey', 'Link', 'Jennifur', 'Ozzy', 'Bluebert', 'Stephanus',
       'Bubbles', 'Zeus', 'Bertson', 'Nico', 'Michelangelope', 'Siba',
       'Calbert', 'Curtis', 'Travis', 'Thumas', 'Kanu', 'Lance', 'Opie',
       'Kane', 'Olive', 'Chuckles', 'Staniel', 'Sora', 'Beemo', 'Gunner',
       'Lacy', 'Tater', 'Olaf', 'Cecil', 'Vince', 'Karma', 'Billy',
       'Walker', 'Rodney', 'Klevin', 'Malikai', 'Bobble', 'River',
       'Jebberson', 'Remington', 'Farfle', 'Jiminus', 'Clarkus',
       'Finnegus', 'Cupcake', 'Kathmandu', 'Ellie', 'Katie', 'Kara',
       'Adele', 'Zara', 'Ambrose', 'Jimothy', 'Bode', 'Terrenth', 'Reese',
       'Chesterson', 'Lucia', 'Bisquick', 'Ralphson', 'Socks', 'Rambo',
       'Rudy', 'Fiji', 'Rilo', 'Bilbo', 'Coopson', 'Yoda', 'Millie',
       'Chet', 'Crouton', 'Daniel', 'Kaia', 'Murphy', 'Dotsy', 'Eazy',
       'Coops', 'Fillup', 'Miley', 'Charl', 'Reagan', 'Yukon', 'CeCe',
       'Cuddles', 'Claude', 'Jessiga', 'Carter', 'Ole', 'Pherb',
       'Blipson', 'Reptar', 'Trevith', 'Berb', 'Bob', 'Colin', 'Brian',
       'Oliviér', 'Grady', 'Kobe', 'Freddery', 'Bodie', 'Dunkin', 'Wally',
       'Tupawc', 'Amber', 'Edgar', 'Teddy', 'Kingsley', 'Brockly',
       'Richie', 'Molly', 'Vinscent', 'Cedrick', 'Hazel', 'Lolo', 'Eriq',
       'Phred', 'Oddie', 'Maxwell', 'Geoff', 'Covach', 'Durg', 'Fynn',
       'Ricky', 'Herald', 'Lucky', 'Ferg', 'Trip', 'Clarence', 'Hamrick',
       'Brad', 'Pubert', 'Frönq', 'Derby', 'Lizzie', 'Ember', 'Blakely',
       'Opal', 'Marq', 'Kramer', 'Barry', 'Gordon', 'Baxter', 'Mona',
       'Horace', 'Crimson', 'Birf', 'Hammond', 'Lorelei', 'Marty',
       'Brooks', 'Petrick', 'Hubertson', 'Gerbald', 'Oreo', 'Bruiser',
       'Perry', 'Bobby', 'Jeph', 'Obi', 'Tino', 'Kulet', 'Sweets', 'Lupe',
       'Tiger', 'Jiminy', 'Griffin', 'Banjo', 'Brandy', 'Lulu', 'Darrel',
       'Taco', 'Joey', 'Patrick', 'Kreg', 'Todo', 'Tess', 'Ulysses',
       'Toffee', 'Apollo', 'Asher', 'Glacier', 'Chuck', 'Champ', 'Ozzie',
       'Griswold', 'Cheesy', 'Moofasa', 'Hector', 'Goliath', 'Kawhi',
       'Emmie', 'Penelope', 'Willie', 'Rinna', 'Mike', 'William',
       'Dwight', 'Evy', 'Rascal', 'Linda', 'Tug', 'Tango', 'Grizz',
       'Jerome', 'Crumpet', 'Jessifer', 'Izzy', 'Ralph', 'Sandy',
       'Humphrey', 'Tassy', 'Juckson', 'Chuq', 'Tyrus', 'Karl',
       'Godzilla', 'Vinnie', 'Kenneth', 'Herm', 'Bert', 'Striker',
       'Donny', 'Pepper', 'Bernie', 'Buddah', 'Lenny', 'Arnold', 'Zuzu',
       'Mollie', 'Laela', 'Tedders', 'Superpup', 'Rufio', 'Jeb', 'Rodman',
       'Jonah', 'Chesney', 'Henry', 'Bobbay', 'Mitch', 'Kaiya', 'Acro',
       'Aiden', 'Obie', 'Dot', 'Shnuggles', 'Kendall', 'Jeffri', 'Steve',
       'Mac', 'Fletcher', 'Kenzie', 'Pumpkin', 'Schnozz', 'Gustaf',
       'Cheryl', 'Ed', 'Leonidas', 'Norman', 'Caryl', 'Scott', 'Taz',
       'Darby', 'Jackie', 'Jazz', 'Franq', 'Pippin', 'Rolf', 'Snickers',
       'Ridley', 'Cal', 'Bradley', 'Bubba', 'Tuco', 'Patch', 'Mojo',
       'Batdog', 'Dylan', 'Mark', 'JD', 'Alejandro', 'Scruffers', 'Pip',
       'Julius', 'Tanner', 'Sparky', 'Anthony', 'Holly', 'Jett', 'Amy',
       'Sage', 'Andy', 'Mason', 'Trigger', 'Antony', 'Creg', 'Traviss',
       'Gin', 'Jeffrie', 'Danny', 'Ester', 'Pluto', 'Bloo', 'Edd',
       'Willy', 'Herb', 'Damon', 'Peanut', 'Nigel', 'Butters', 'Sandra',
       'Fabio', 'Randall', 'Liam', 'Tommy', 'Ben', 'Raphael', 'Julio',
       'Andru', 'Kloey', 'Shawwn', 'Skye', 'Kollin', 'Ronduh', 'Billl',
       'Saydee', 'Dug', 'Tessa', 'Sully', 'Kirk', 'Ralf', 'Clarq',
       'Jaspers', 'Samsom', 'Harrison', 'Chaz', 'Jeremy', 'Jaycob',
       'Lambeau', 'Ruffles', 'Amélie', 'Bobb', 'Banditt', 'Kevon',
       'Winifred', 'Hanz', 'Churlie', 'Zeek', 'Timofy', 'Maks',
       'Jomathan', 'Kallie', 'Marvin', 'Spark', 'Gòrdón', 'Jo', 'DayZ',
       'Jareld', 'Torque', 'Ron', 'Skittles', 'Cleopatricia', 'Erik',
       'Stu', 'Tedrick', 'Filup', 'Kial', 'Naphaniel', 'Dook', 'Hall',
       'Philippe', 'Biden', 'Fwed', 'Genevieve', 'Joshwa', 'Bradlay',
       'Clybe', 'Keet', 'Carll', 'Jockson', 'Josep', 'Lugan',
       'Christoper'], dtype=object)

Define¶

Change in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id and retweeted_status_user_id to uint64 and use 0 as not-available

Change timestamp and retweeted_status_timestamp to datetimes

Code¶

Convert data types to integer but ignore null values

def convert_to_uint64(df, column):
    tmp = pd.Series(index=df.index, dtype="uint64")
    non_na_mask = ~df[column].isna()
    non_na = df[column][non_na_mask].astype("uint64")
    tmp[non_na_mask] = non_na    
    return tmp

for column in ["in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id"]:
    df_archive_clean[column] = convert_to_uint64(df_archive_clean, column)

df_archive_clean.timestamp = pd.to_datetime(df_archive_clean.timestamp)
df_archive_clean.retweeted_status_timestamp = pd.to_datetime(df_archive_clean.retweeted_status_timestamp)

Test¶

# Check data types are now correct
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2247 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2247 non-null int64
in_reply_to_status_id         2247 non-null uint64
in_reply_to_user_id           2247 non-null uint64
timestamp                     2247 non-null datetime64[ns, UTC]
source                        2247 non-null category
text                          2247 non-null object
retweeted_status_id           2247 non-null uint64
retweeted_status_user_id      2247 non-null uint64
retweeted_status_timestamp    176 non-null datetime64[ns, UTC]
expanded_urls                 2188 non-null object
rating_numerator              2247 non-null int64
rating_denominator            2247 non-null int64
name                          2247 non-null object
doggo                         2247 non-null object
floofer                       2247 non-null object
pupper                        2247 non-null object
puppo                         2247 non-null object
dtypes: category(1), datetime64[ns, UTC](2), int64(3), object(7), uint64(4)
memory usage: 380.8+ KB

Define¶

Use tweet_id to get expanded_urls

Code¶

If expanded_urls is null, add by using tweet id

df_archive_clean['expanded_urls'].isna().sum()

59

def add_urls(row):
    # Don't do anything if expanded_urls is not null
    if pd.notnull(row['expanded_urls']):
        return row
    else:
        # Get tweet_id
        tweet_id = row['tweet_id']
        # Make new URL and save into expanded_urls
        row['expanded_urls'] = 'https://twitter.com/dog_rates/status/{}'.format(tweet_id)
        # return updated row
        return row


# Save dataframe with missing urls
df_archive_clean = df_archive_clean.apply(add_urls, axis=1)

Test¶

Check that there are no empty expanded_urls

print(df_archive_clean[df_archive_clean['expanded_urls'].isnull()])
print(df_archive_clean['expanded_urls'].isna().sum())

Empty DataFrame
Columns: [tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, text, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls, rating_numerator, rating_denominator, name, doggo, floofer, pupper, puppo]
Index: []
0

df_archive_clean['expanded_urls']

0       https://twitter.com/dog_rates/status/892420643...
1       https://twitter.com/dog_rates/status/892177421...
2       https://twitter.com/dog_rates/status/891815181...
3       https://twitter.com/dog_rates/status/891689557...
4       https://twitter.com/dog_rates/status/891327558...
                              ...                        
2342    https://twitter.com/dog_rates/status/666082916...
2343    https://twitter.com/dog_rates/status/666073100...
2344    https://twitter.com/dog_rates/status/666071193...
2351    https://twitter.com/dog_rates/status/666049248...
2355    https://twitter.com/dog_rates/status/666020888...
Name: expanded_urls, Length: 2247, dtype: object

Define¶

Quality¶

df_archive_clean: contains retweets and therefore, duplicates

Define¶

Keep only those rows in df_archive_clean table that are original tweets and NOT retweets (i.e. retweeted_status_id column is null). Delete the rest.

# Note: 0 represent missing ids

len(df_archive_clean[df_archive_clean.retweeted_status_id != 0])

176

Code¶

df_archive_clean = df_archive_clean[df_archive_clean.retweeted_status_id==0]

Test¶

len(df_archive_clean[df_archive_clean.retweeted_status_id != 0])

0

df_archive_clean: 343rd entry is not a dog rating¶

Define¶

Remove 343th entry from dataframe

Code¶

# Drop 343rd entry (index=342 since starts at zero)
df_archive_clean = df_archive_clean.drop(df_archive_clean.index[342])

Test¶

# Check entries about point 
df_archive_clean.iloc[340:343]

df_archive_clean: some entries should be classified as puppers (missing data)¶

Define¶

Find when text has dog stages in it and add it to the corresponding column

for col in ['pupper', 'doggo', 'puppo', 'floofer']:
    print(col)
    print(df_archive_clean[col].value_counts())

pupper
None      1848
pupper     222
Name: pupper, dtype: int64
doggo
None     1986
doggo      84
Name: doggo, dtype: int64
puppo
None     2046
puppo      24
Name: puppo, dtype: int64
floofer
None       2060
floofer      10
Name: floofer, dtype: int64

Code¶

# Find when 'dog stage' is used in text column
def extend_dog_stages(row):
    # Only do something if <dog_stage> is in text
    for dog_stage in ['pupper', 'doggo', 'puppo', 'floofer']:
        if dog_stage in text:
            row[dog_stage] = dog_stage
        
    # return row whether or not it was updated    
    return row

Test¶

for col in ['pupper', 'doggo', 'puppo', 'floofer']:
    print(col)
    print(df_archive_clean[col].value_counts())

pupper
None      1848
pupper     222
Name: pupper, dtype: int64
doggo
None     1986
doggo      84
Name: doggo, dtype: int64
puppo
None     2046
puppo      24
Name: puppo, dtype: int64
floofer
None       2060
floofer      10
Name: floofer, dtype: int64

df_archive_clean: some entries are retweets¶

Define¶

Finding entries that are retweets by matching text pattern 'RT @...' and drop entries from the table

Code¶

# Find all the retweets 
retweets = df_archive_clean.text.str.match('^RT @')

# Only keep the tweets that aren't retweets
df_archive_clean = df_archive_clean[~retweets]

Test¶

# Test that we have no retweets
retweets = df_archive_clean.text.str.match('^RT @')
df_archive_clean[retweets]

df_archive_clean: some ratings are decimal¶

Define¶

Get the decimal ratings by matching the text pattern "#.#/#" and save as new rating (only denominators had decimals)

print(list(df_archive_clean['rating_numerator'].unique()))

[13, 12, 14, 5, 17, 11, 10, 420, 666, 6, 182, 15, 960, 0, 7, 84, 24, 75, 27, 3, 8, 9, 4, 165, 1776, 50, 99, 80, 45, 44, 1, 143, 121, 20, 26, 2, 144, 88]

Code¶

# Apply function to redo rating if a denominator is found
import re

def get_decimal_rating(row):
    #if text matches for numerator, change it
    decimal_numerator_pattern = '(\d+\.\d+)\/(\d+)'
    matches = re.search(decimal_numerator_pattern, row['text'])
    if matches != None:
        row['rating_numerator'] = float(matches.group(1))
        row['rating_denominator'] = int(matches.group(2))


    return row
    

    
df_archive_clean = df_archive_clean.apply(get_decimal_rating, axis=1)

Test¶

print(list(df_archive_clean['rating_numerator'].unique()))
# 13.5, 11.27, 11.26 are now present

[13.0, 12.0, 14.0, 13.5, 17.0, 11.0, 10.0, 420.0, 666.0, 6.0, 182.0, 15.0, 960.0, 0.0, 7.0, 84.0, 24.0, 9.75, 5.0, 11.27, 3.0, 8.0, 9.0, 4.0, 165.0, 1776.0, 50.0, 99.0, 80.0, 45.0, 44.0, 1.0, 143.0, 121.0, 20.0, 9.5, 11.26, 2.0, 144.0, 88.0]

Define¶

Most ratings are expected to have a denominator of 10 and a numerator around 0...20.

We have a look at potential outliers and manually suppress incorrect tweets.

Code¶

df_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

df_archive.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

pd.set_option('display.max_colwidth', -1)

rating_ratio = df_archive_clean.rating_numerator / df_archive_clean.rating_denominator
mask = (rating_ratio < 0) | (rating_ratio > 2.0)

df_archive_clean[['rating_numerator', 'rating_denominator', 'text']][mask]

# suppression of rows 313 and 516

df_archive_clean = df_archive_clean.drop([313, 516])

Test¶

pd.set_option('display.max_colwidth', -1)

rating_ratio = df_archive_clean.rating_numerator / df_archive_clean.rating_denominator
mask = (rating_ratio < 0) | (rating_ratio > 2.0)

df_archive_clean[['rating_numerator', 'rating_denominator', 'text']][mask]

Define¶

Keep only those records in arc_df table whose tweet_id exists in img_df table

Code¶

df_archive_clean = df_archive_clean[df_archive_clean.tweet_id.isin(predict_img_df.tweet_id)]

Test¶

len(df_archive_clean[~df_archive_clean.tweet_id.isin(predict_img_df.tweet_id)])

0

df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1894 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      1894 non-null int64
in_reply_to_status_id         1894 non-null int64
in_reply_to_user_id           1894 non-null int64
timestamp                     1894 non-null datetime64[ns, UTC]
source                        1894 non-null object
text                          1894 non-null object
retweeted_status_id           1894 non-null int64
retweeted_status_user_id      1894 non-null int64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 1894 non-null object
rating_numerator              1894 non-null float64
rating_denominator            1894 non-null int64
name                          1894 non-null object
doggo                         1894 non-null object
floofer                       1894 non-null object
pupper                        1894 non-null object
puppo                         1894 non-null object
dtypes: datetime64[ns, UTC](1), datetime64[ns](1), float64(1), int64(6), object(8)
memory usage: 266.3+ KB

Define¶

Drop retweeted_status_id, retweeted_status_user_id and retweeted_status_timestamp columns from df_archive_clean table

df_archive_clean.drop(['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis=1, inplace=True)

Test¶

df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1894 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 1894 non-null int64
in_reply_to_status_id    1894 non-null int64
in_reply_to_user_id      1894 non-null int64
timestamp                1894 non-null datetime64[ns, UTC]
source                   1894 non-null object
text                     1894 non-null object
expanded_urls            1894 non-null object
rating_numerator         1894 non-null float64
rating_denominator       1894 non-null int64
name                     1894 non-null object
doggo                    1894 non-null object
floofer                  1894 non-null object
pupper                   1894 non-null object
puppo                    1894 non-null object
dtypes: datetime64[ns, UTC](1), float64(1), int64(4), object(8)
memory usage: 222.0+ KB

df_archive_clean: text column contains untruncated text instead of displayable text¶

Quality¶

One variable in four columns (doggo, floofer, pupper, and puppo)

Define¶

doggo, floofer, pupper and puppo columns in df_archive_clean table should be merged into one column named "dog_stage"

Code¶

Number of records whose both doggo and floofer columns are not None¶

len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.floofer != 'None')])

1

Number of records whose both doggo and pupper columns are not None¶

len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.pupper != 'None')])

8

Number of records whose both doggo and pupper columns are not None¶

len(df_archive_clean[(df_archive_clean.doggo != 'None') & (df_archive_clean.puppo != 'None')])

1

Number of records whose both floofer and pupper columns are not None¶

len(df_archive_clean[(df_archive_clean.floofer != 'None') & (df_archive_clean.pupper != 'None')])

0

Number of records whose both floofer and puppo columns are not None¶

len(df_archive_clean[(df_archive_clean.floofer != 'None') & (df_archive_clean.puppo != 'None')])

0

Number of records whose both pupper and puppo columns are not None¶

len(df_archive_clean[(df_archive_clean.pupper != 'None') & (df_archive_clean.puppo != 'None')])

0

Dog stages into one columns!¶

# Dog stages into one columns!
df_tmp = df_archive_clean.copy()
df_tmp['dog_stages'] = None
for col in ['doggo', 'floofer', 'pupper', 'puppo']:
    mask = df_tmp[col]==col
    df_tmp['dog_stages'][mask] = col

mask_doggo_floofer = (df_archive_clean.doggo != 'None') & (df_archive_clean.floofer != 'None')
df_tmp['dog_stages'][mask_doggo_floofer] = 'doggo|floofer'

mask_doggo_pupper = (df_archive_clean.doggo != 'None') & (df_archive_clean.pupper != 'None')
df_tmp['dog_stages'][mask_doggo_pupper] = 'doggo|pupper'

mask_doggo_puppo = (df_archive_clean.doggo != 'None') & (df_archive_clean.puppo != 'None')
df_tmp['dog_stages'][mask_doggo_puppo] = 'doggo|puppo'

df_tmp['dog_stages'].value_counts()

df_archive_clean = df_tmp

# drop old columns
df_archive_clean = df_archive_clean.drop(['doggo', 'pupper', 'puppo', 'floofer'], axis=1)

Test¶

df_archive_clean['dog_stages'].value_counts()

pupper           194
doggo            61 
puppo            21 
doggo|pupper     8  
floofer          7  
doggo|floofer    1  
doggo|puppo      1  
Name: dog_stages, dtype: int64

Define¶

Replace the value 'None' with NaN (missing value)

Code¶

#replace the value 'None' with NaN (missing value)
df_archive_clean =  df_archive_clean.replace('None', np.nan)

Test¶

df_archive_clean.isnull().sum()

tweet_id                 0   
in_reply_to_status_id    0   
in_reply_to_user_id      0   
timestamp                0   
source                   0   
text                     0   
expanded_urls            0   
rating_numerator         0   
rating_denominator       0   
name                     545 
dog_stages               1601
dtype: int64

Quality¶

'None' previously passed the test as a name but isn't one. Drop missing tweets with missing names: Previously 'None', now nan.

Code¶

mask = ~df_archive_clean['name'].isna()
df_archive_clean = df_archive_clean[mask]

test¶

df_archive_clean.isnull().sum()

tweet_id                 0   
in_reply_to_status_id    0   
in_reply_to_user_id      0   
timestamp                0   
source                   0   
text                     0   
expanded_urls            0   
rating_numerator         0   
rating_denominator       0   
name                     0   
dog_stages               1172
dtype: int64

predict_img_df¶

p1, p2, p3 inconsistent capitalization (sometimes first letter is capital)

Define¶

Making all strings in p1, p2, and p3 lowercase

Code¶

predict_img_df['p1'] = predict_img_df['p1'].str.lower()
predict_img_df['p2'] = predict_img_df['p2'].str.lower()
predict_img_df['p3'] = predict_img_df['p3'].str.lower()

Test¶

predict_img_df.head()

combine archive and predict_img dataframes¶

Define¶

Creating new dataframe by merging data from image predictions into a copy of twitter archive if the image prediction's tweet is already found in the twitter archive

Code¶

# Only keep predictions that have ids in archive 
# (final size is as large as archive) 
df1 = df_archive_clean
df2 = predict_img_df_clean

final_df_clean  = pd.merge(df1, df2, how='left', on=['tweet_id'])

final_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 22 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(5), object(9)
memory usage: 214.7+ KB

Test¶

# If tweet id (from image predictions) not in archive, 
# then it isn't in final merged dataframe
tweet_ids = predict_img_df_clean.tweet_id

for tweet_id in tweet_ids:
    # Test if id is also in archive
    if tweet_id not in df_archive_clean.tweet_id.values:
        # Check that it also isn't in new df
        if tweet_id in final_df_clean.tweet_id.values:
            print('Paradox! ID#{} in final DF but shouldn\'t be'.format(tweet_id))

status_df (Twitter API)¶

no empty cells¶

status_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
tweet_id              2354 non-null int64
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
display_text_range    2354 non-null object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB

Define¶

Merge data from tweet info into a final dataframe if the tweet info's tweet is already found in the final dataframe (twitter archive)

Note that the 9 missing tweets are because the tweets have been removed.

Code¶

# Only keep predictions that have ids in archive 
# (final size is as large as archive) 
df1 = final_df_clean
df2 = status_df_clean
df2.rename(columns={'id':'tweet_id'}, inplace=True)

final_df_clean = pd.merge(df1, df2, how='left', on=['tweet_id'])

Test¶

# If tweet id (from tweet info) not in archive, 
# then it isn't in final merged dataframe
tweet_ids = status_df_clean.tweet_id

for tweet_id in tweet_ids:
    # Test if id is also in new df
    if tweet_id not in df_archive_clean.tweet_id.values:
        # Test if id is also in new df
        if tweet_id in final_df_clean.tweet_id.values:
            print('ID#{} in final DF but shouldn\'t be'.format(tweet_id))
            
final_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(7), object(10)
memory usage: 246.4+ KB

Test¶

final_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1349 entries, 0 to 1348
Data columns (total 25 columns):
tweet_id                 1349 non-null int64
in_reply_to_status_id    1349 non-null int64
in_reply_to_user_id      1349 non-null int64
timestamp                1349 non-null datetime64[ns, UTC]
source                   1349 non-null object
text                     1349 non-null object
expanded_urls            1349 non-null object
rating_numerator         1349 non-null float64
rating_denominator       1349 non-null int64
name                     1349 non-null object
dog_stages               177 non-null object
jpg_url                  1349 non-null object
img_num                  1349 non-null int64
p1                       1349 non-null object
p1_conf                  1349 non-null float64
p1_dog                   1349 non-null bool
p2                       1349 non-null object
p2_conf                  1349 non-null float64
p2_dog                   1349 non-null bool
p3                       1349 non-null object
p3_conf                  1349 non-null float64
p3_dog                   1349 non-null bool
retweet_count            1349 non-null int64
favorite_count           1349 non-null int64
display_text_range       1349 non-null object
dtypes: bool(3), datetime64[ns, UTC](1), float64(4), int64(7), object(10)
memory usage: 246.4+ KB

Storing Data¶

Storing our final and clean dataframe into a CSV file twitter_archive_master.csv

final_df_clean.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)

Analyzing Data¶

final_df_clean = pd.read_csv('twitter_archive_master.csv')

Most used Twitter source¶

final_df_clean['source'].value_counts()

Twitter for iPhone    1324
Twitter Web Client    16  
TweetDeck             9   
Name: source, dtype: int64

WeRateDogs has posted 98% of the tweets from iPhone.

Analysis of rating of dogs¶

final_df_clean['rating_numerator'].value_counts().sort_index()

2.00       3  
3.00       10 
4.00       5  
5.00       13 
6.00       17 
7.00       33 
8.00       70 
9.00       100
9.75       1  
10.00      273
11.00      301
11.27      1  
12.00      326
13.00      176
13.50      1  
14.00      17 
50.00      1  
1776.00    1  
Name: rating_numerator, dtype: int64

final_df_clean['rating_numerator'][final_df_clean['rating_numerator'] > 10].value_counts().sum()

824

Out of a total of 1349 dogs rated by WeRateDogs, a total of 824 were rated above 10/10, which is almost 61%. However, only 1 was rated the highest: 1776 .

Most Common Dog Breeds¶

Motivation: Which are the most common breeds? Are some dog breeds more common in tweets?

# Get attributes before determinig dog breed
attributes = ['retweet_count', 'favorite_count', 'rating_denominator', 'rating_numerator']
df_dog_breeds = final_df_clean[attributes].copy()

# Create rating column
def percent_rating(row):
    if row['rating_denominator'] == 0:
        return 0
    return row['rating_numerator']/row['rating_denominator']


df_dog_breeds['rating'] = df_dog_breeds.apply(percent_rating, axis=1)

# For each entry, check which is the most likely breed
def best_breed_match(row):
    
    # Defaults to compare against
    breed = 'not_dog'
    confidence = 0
    
    # Only keep if it is a possible dog
    dog_preds = [(row[['p1', 'p1_dog', 'p1_conf']]), (row[['p2', 'p2_dog', 'p2_conf']]), (row[['p3', 'p3_dog', 'p3_conf']])]
    # Use this for easy reference
    index_breed, index_isDog, index_conf = 0,1,2
    
    for pred in dog_preds:
        # If it's a dog breed, check if it's max confidence seen
        if pred[index_isDog]:
            # save breed and confidence if higher confidence
            if pred[index_conf] >= confidence:
                breed = pred[index_breed]
                confidence = pred[index_conf]
    
    # Update breed list
    row['breed'] = breed
    row['breed_conf'] = confidence
    return row
        
    breeds = [row['p1'], row['p2'], row['p3']]
    
    
df_possible_breeds = df_possible_breeds.apply(best_breed_match, axis=1)

# Save info into dog breed dataframe
df_dog_breeds['breed'] = df_possible_breeds['breed']

df_dog_breeds['breed_conf'] = df_possible_breeds['breed_conf']

# show 30 most common dog breeds
df_dog_breeds.breed.value_counts()[:30] # 'not_dog' is most common so keep this out of plot

not_dog                           163
golden_retriever                  104
Labrador_retriever                72 
Pembroke                          71 
Chihuahua                         67 
pug                               45 
toy_poodle                        39 
chow                              36 
Pomeranian                        28 
French_bulldog                    26 
Samoyed                           25 
malamute                          24 
Chesapeake_Bay_retriever          22 
cocker_spaniel                    20 
Siberian_husky                    19 
miniature_pinscher                19 
Shih-Tzu                          17 
Staffordshire_bullterrier         17 
Cardigan                          16 
beagle                            16 
German_shepherd                   15 
Eskimo_dog                        14 
Shetland_sheepdog                 14 
Old_English_sheepdog              13 
Pekinese                          12 
schipperke                        12 
basset                            12 
Italian_greyhound                 11 
American_Staffordshire_terrier    11 
Lakeland_terrier                  11 
Name: breed, dtype: int64

# Plot the 12 most common dog breeds in bar chart
plt.figure(figsize=(15,10))
only_dogs = df_dog_breeds['breed'] != 'not_dog'
df_dog_breeds[only_dogs].breed.value_counts()[12::-1].plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x1fb9186e788>

We first look at the top 30 most common dog breeds. We see that nearly 500 tweets were classified as not being a dog, which is about 3 times more than the first real dog breed. It could be that many tweets in this ranking archive are not dogs, but this seems strange and unlikely. It is possible that the image predictions have misclassified many images as not being dogs. This seems more likely, especially considering that the data comes from an image prediction code that may have made significant errors.

We then plotted the 12 most common (real) breeds of dogs in a bar graph. We see that according to this data, the most common breed of dog in the tweets was the golden retriever with more than 150 tweets of the breed of dog. The other four most common breeds, in descending order, were Labrador retriever, Pembrooke (corgi), Chihuahua and Pug. The first three breeds seem to fit well with my general use of the Internet and my experience in viewing different images of dogs.

However, we must again be careful not to draw too many conclusions from this data, as we rely on the prediction code of the images to inform us about the breed of dog. It is possible that some breeds of dogs were more difficult to determine by the prediction code and should in fact be at the top of the list. In future surveys, the effectiveness of the image prediction data could be further tested to ensure that the data accurately reflects the different breeds of dogs.

plt.figure(figsize=(15,10))
final_df_clean['source'].value_counts().plot('barh', figsize=(11,5), title='Most used Twitter source').set_xlabel("Number of Tweets")
plt.savefig('twitter_source')

plt.figure(figsize=(15,10))
final_df_clean.name.value_counts()[1:7].plot('barh', figsize=(11,5), title='Top 6 common dog names').set_xlabel("Number of Dogs")
plt.savefig('dog_names')

final_df_clean.dog_stages.value_counts()

pupper          117
doggo           36 
puppo           15 
floofer         5  
doggo|pupper    4  
Name: dog_stages, dtype: int64

dog_counts = final_df_clean.dog_stages.value_counts()

fig,ax = plt.subplots(figsize = (15,10))
ax.bar(dog_counts.index, dog_counts.values, width = 0.8)
ax.set_ylabel('Dog Count')
ax.set_xlabel('Category')
plt.title("Most Common Dog Category")
plt.show()

plt.figure(figsize=(15,10))
sns.lmplot(x = 'rating_numerator', y = 'rating_denominator',
          data = final_df_clean,
          hue = 'source',
          palette ='Set2', fit_reg = False)

<seaborn.axisgrid.FacetGrid at 0x1fb917763c8>

<Figure size 1080x720 with 0 Axes>

Insights¶

The most common dog type is a "pupper"
The most common dog, based on image predictions, is a Golden Retriever

retweet_count = final_df_clean.retweet_count
print("The median and mean retweet count are: ", retweet_count.median(), "and ", retweet_count.mean(), "respectively." )

favorite_count = final_df_clean.favorite_count
print("The median and mean favorite count are: ", favorite_count.median(), "and ", favorite_count.mean(), "respectively." )

rating = df_dog_breeds.rating
print("The median and mean rating are: ", rating.median(), "and ", rating.mean(), "respectively." )

The median and mean retweet count are:  1541.0 and  2752.896219421794 respectively.
The median and mean favorite count are:  4878.0 and  9410.949592290586 respectively.
The median and mean rating are:  1.1 and  1.2018445986926345 respectively.

	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	rating_numerator	rating_denominator	img_num	p1_conf	p2_conf	p3_conf	retweet_count	favorite_count
count	1.349000e+03	1.349000e+03	1.349000e+03	1349.000000	1349.000000	1349.000000	1349.000000	1.349000e+03	1.349000e+03	1349.000000	1349.000000
mean	7.446039e+17	4.950381e+14	3.111181e+06	12.048569	10.030393	1.210526	0.600582	1.338343e-01	6.050994e-02	2752.896219	9410.949592
std	6.880680e+16	1.818212e+16	1.142698e+08	48.112024	1.089386	0.574050	0.274176	1.007713e-01	5.168008e-02	3969.380179	11685.811257
min	6.664188e+17	0.000000e+00	0.000000e+00	2.000000	10.000000	1.000000	0.044333	1.011300e-08	1.740170e-10	23.000000	107.000000
25%	6.801158e+17	0.000000e+00	0.000000e+00	10.000000	10.000000	1.000000	0.367368	5.171310e-02	1.540050e-02	708.000000	2369.000000
50%	7.240499e+17	0.000000e+00	0.000000e+00	11.000000	10.000000	1.000000	0.599660	1.180890e-01	4.923690e-02	1541.000000	4878.000000
75%	8.011151e+17	0.000000e+00	0.000000e+00	12.000000	10.000000	1.000000	0.868560	1.975130e-01	9.411200e-02	3357.000000	12334.000000
max	8.924206e+17	6.678065e+17	4.196984e+09	1776.000000	50.000000	4.000000	1.000000	4.880140e-01	2.734190e-01	56625.000000	107956.000000

	id	in_reply_to_status_id	in_reply_to_user_id	retweet_count	favorite_count
count	2.000000e+01	1.800000e+01	1.800000e+01	20.000000	20.000000
mean	1.305942e+18	1.305878e+18	1.134813e+17	1140.200000	11007.300000
std	3.024863e+14	3.933680e+14	3.415700e+17	3477.566793	31512.876994
min	1.305537e+18	1.304873e+18	1.337271e+06	0.000000	4.000000
25%	1.305661e+18	1.305539e+18	1.489313e+08	0.000000	112.000000
50%	1.305910e+18	1.305905e+18	5.359372e+08	0.500000	501.500000
75%	1.306268e+18	1.306261e+18	1.935403e+09	8.500000	1432.250000
max	1.306382e+18	1.306379e+18	1.275259e+18	12118.000000	112342.000000

	tweet_id	retweet_count	favorite_count	display_text_range
0	892420643555336193	8853	39467	[0, 85]
1	892177421306343426	6514	33819	[0, 138]
2	891815181378084864	4328	25461	[0, 121]
3	891689557279858688	8964	42908	[0, 79]
4	891327558926688256	9774	41048	[0, 138]
...	...	...	...	...
2349	666049248165822465	41	111	[0, 120]
2350	666044226329800704	147	311	[0, 137]
2351	666033412701032449	47	128	[0, 130]
2352	666029285002620928	48	132	[0, 139]
2353	666020888022790149	532	2535	[0, 131]

	tweet_id	img_num	p1_conf	p2_conf	p3_conf
count	2.075000e+03	2075.000000	2075.000000	2.075000e+03	2.075000e+03
mean	7.384514e+17	1.203855	0.594548	1.345886e-01	6.032417e-02
std	6.785203e+16	0.561875	0.271174	1.006657e-01	5.090593e-02
min	6.660209e+17	1.000000	0.044333	1.011300e-08	1.740170e-10
25%	6.764835e+17	1.000000	0.364412	5.388625e-02	1.622240e-02
50%	7.119988e+17	1.000000	0.588230	1.181810e-01	4.944380e-02
75%	7.932034e+17	1.000000	0.843855	1.955655e-01	9.180755e-02
max	8.924206e+17	4.000000	1.000000	4.880140e-01	2.734190e-01

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	retweeted_status_id	retweeted_status_user_id	rating_numerator	rating_denominator
count	2.356000e+03	7.800000e+01	7.800000e+01	1.810000e+02	1.810000e+02	2356.000000	2356.000000
mean	7.427716e+17	7.455079e+17	2.014171e+16	7.720400e+17	1.241698e+16	13.126486	10.455433
std	6.856705e+16	7.582492e+16	1.252797e+17	6.236928e+16	9.599254e+16	45.876648	6.745237
min	6.660209e+17	6.658147e+17	1.185634e+07	6.661041e+17	7.832140e+05	0.000000	0.000000
25%	6.783989e+17	6.757419e+17	3.086374e+08	7.186315e+17	4.196984e+09	10.000000	10.000000
50%	7.196279e+17	7.038708e+17	4.196984e+09	7.804657e+17	4.196984e+09	11.000000	10.000000
75%	7.993373e+17	8.257804e+17	4.196984e+09	8.203146e+17	4.196984e+09	12.000000	10.000000
max	8.924206e+17	8.862664e+17	8.405479e+17	8.874740e+17	7.874618e+17	1776.000000	170.000000

	created_at	id	id_str	text	truncated	entities	source	in_reply_to_status_id	in_reply_to_status_id_str	in_reply_to_user_id	...	coordinates	place	contributors	is_quote_status	retweet_count	favorite_count	favorited	retweeted	lang	possibly_sensitive
0	Wed Sep 16 23:58:30 +0000 2020	1306381981140226049	1306381981140226049	@melissamjwine omg	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="https://about.twitter.com/products/tw...	1.306379e+18	1306379029067554816	2.948032e+09	...	None	None	None	False	0	6	False	False	und	NaN
1	Wed Sep 16 18:34:30 +0000 2020	1306300446273724416	1306300446273724416	@laurDIY innocent	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.306300e+18	1306300162885607425	3.804365e+08	...	None	None	None	False	0	588	False	False	en	NaN
2	Wed Sep 16 18:08:24 +0000 2020	1306293875263172609	1306293875263172609	@afookinglousah that’d be wild	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.306287e+18	1306287330815365127	2.275193e+09	...	None	None	None	False	1	265	False	False	en	NaN
3	Wed Sep 16 16:45:03 +0000 2020	1306272902250033153	1306272902250033153	@Sm813Nancy MIKE WAGOWSKI	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.306271e+18	1306271204328771584	5.222814e+08	...	None	None	None	False	3	1561	False	False	pl	NaN
4	Wed Sep 16 16:28:55 +0000 2020	1306268838573424645	1306268838573424645	@abby_geniusz i do yes 🥺	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.306268e+18	1306268049947975682	7.674052e+17	...	None	None	None	False	0	538	False	False	en	NaN
5	Wed Sep 16 16:25:13 +0000 2020	1306267907362492416	1306267907362492416	This is Tao and Oko. Tao had to have his eyes ...	True	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	NaN	None	NaN	...	None	None	None	False	10435	92838	False	False	en	False
6	Wed Sep 16 15:15:21 +0000 2020	1306250326442299394	1306250326442299394	@SarahCAndersen the dog understood her that’s ...	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.306240e+18	1306239570799931393	9.160316e+08	...	None	None	None	False	203	6564	False	False	en	NaN
7	Tue Sep 15 19:07:08 +0000 2020	1305946269643190272	1305946269643190272	@bschwaaa 😭😭😭	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305946e+18	1305945544632668160	8.591741e+08	...	None	None	None	False	0	126	False	False	und	NaN
8	Tue Sep 15 18:45:12 +0000 2020	1305940748865601537	1305940748865601537	@Jen_eration_x @jamiemacqueen omg	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305929e+18	1305929295949262848	2.592777e+07	...	None	None	None	False	0	24	False	False	und	NaN
9	Tue Sep 15 17:06:53 +0000 2020	1305916009174323201	1305916009174323201	@carrieannie3 @Jen_eration_x SO EXCITING	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="https://about.twitter.com/products/tw...	1.305906e+18	1305905992887152640	3.482786e+08	...	None	None	None	False	0	120	False	False	en	NaN
10	Tue Sep 15 16:22:56 +0000 2020	1305904946194259968	1305904946194259968	@darth @Jen_eration_x HE HAD A HUNCH DARTH THI...	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="https://about.twitter.com/products/tw...	1.305905e+18	1305904658783723520	1.337271e+06	...	None	None	None	False	10	1496	False	False	en	NaN
11	Tue Sep 15 16:16:58 +0000 2020	1305903446436933633	1305903446436933633	@ProfBreen587 @Jen_eration_x yes he can	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305902e+18	1305902444820525064	1.275259e+18	...	None	None	None	False	0	523	False	False	en	NaN
12	Tue Sep 15 16:04:38 +0000 2020	1305900341829935104	1305900341829935104	@JordanGreenwaId hey i like your songs	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305900e+18	1305899852866310144	8.248215e+07	...	None	None	None	False	8	88	False	False	en	NaN
13	Tue Sep 15 16:01:39 +0000 2020	1305899590726504449	1305899590726504449	This is Henry. He's had a very long year. Wond...	True	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	NaN	None	NaN	...	None	None	None	False	12118	112342	False	False	en	False
14	Tue Sep 15 02:47:47 +0000 2020	1305699806585786370	1305699806585786370	@ButteSheriff hi! any pupdates on trooper?	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.304873e+18	1304872699534008320	7.878647e+08	...	None	None	None	False	1	11	False	False	en	NaN
15	Mon Sep 14 16:28:48 +0000 2020	1305544036644659211	1305544036644659211	@andrea1079 @TakuroSpirit @asdfghjill no no th...	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305544e+18	1305543830356230148	2.703605e+07	...	None	None	None	False	2	480	False	False	en	NaN
16	Mon Sep 14 16:13:39 +0000 2020	1305540222214578185	1305540222214578185	@papscun 😍😍😍	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305537e+18	1305536750920568837	3.351042e+09	...	None	None	None	False	0	4	False	False	und	NaN
17	Mon Sep 14 16:09:38 +0000 2020	1305539211156570113	1305539211156570113	@TakuroSpirit @asdfghjill i’m sorry but “dogs ...	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305538e+18	1305537906111582209	2.034178e+07	...	None	None	None	False	23	1411	False	False	en	NaN
18	Mon Sep 14 16:03:19 +0000 2020	1305537621108469761	1305537621108469761	@prashamparikh @asdfghjill same 😔	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305537e+18	1305537439038021632	4.168653e+08	...	None	None	None	False	0	372	False	False	en	NaN
19	Mon Sep 14 16:02:35 +0000 2020	1305537435925848064	1305537435925848064	@RissRiss24 @asdfghjill it hit him at 40mph it...	False	{'hashtags': [], 'symbols': [], 'user_mentions...	<a href="http://twitter.com/download/iphone" r...	1.305537e+18	1305537250105741314	5.495929e+08	...	None	None	None	False	0	789	False	False	en	NaN

	tweet_id	timestamp	source	text	retweeted_status_timestamp	expanded_urls	rating_numerator	rating_denominator	name	doggo	floofer	pupper	puppo
410	823322678127919110	2017-01-23 00:13:17+00:00	Twitter for iPhone	This is Wyatt. He's got the fastest paws in th...	NaT	https://twitter.com/dog_rates/status/823322678...	11	10	Wyatt	None	None	None	None
412	822975315408461824	2017-01-22 01:12:59+00:00	Twitter for iPhone	This is Albus. He's soaked as h*ck. Seems to h...	NaT	https://twitter.com/dog_rates/status/822975315...	12	10	Albus	None	None	None	None
414	822859134160621569	2017-01-21 17:31:20+00:00	Twitter for iPhone	This is Hobbes. He was told he was going to th...	NaT	https://twitter.com/dog_rates/status/822859134...	12	10	Hobbes	None	None	None	None

	rating_numerator	rating_denominator	text
188	420.0	10	@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
189	666.0	10	@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
290	182.0	10	@markhoppus 182/10
313	960.0	0	@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
516	24.0	7	Meet Sam. She smiles 24/7 & secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx
979	1776.0	10	This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
2074	420.0	10	After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY